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ABSTRACT: This paper details the anticipated impact of synthetic "big" data on learning analytics 
(LA) infrastructures, with a particular focus on data governance, the acceleration of service 
development, and the benchmarking of predictive models. By reviewing two cases, one at the 
sector-wide level (the Jisc learning analytics architecture) and the other at the institutional level 
(the UvAInform learning analytics project at the University of Amsterdam), we explore the need 
for an on-demand tool for generating a wide range of synthetic data. We argue that the application 
of synthetic data will not only accelerate the creation of complex and layered learning analytics 
infrastructure, but will also help to address the ethical and privacy risks involved during service 
development. 
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1 INTRODUCTION 


There is growing interest in deploying learning analytics services at educational institutions. Stimulating 
the interest in developing and deploying learning analytics services are a number of successful examples 
that have affected student learning. Of these, Course Signals is arguably the best known (Arnold & Pistilli, 
2012). Another example is the Open Academic Analytics Initiative (OAAI) led by Marist College 
(Jayaprakash, Moody, Lauria, Regan, & Baron, 2014). Building on early work in LA, Siemens et al. (2011) 
proposed developing an overarching framework for learning analytics. An all-encompassing framework 
would need to include the following: 1) the collection of data, 2) dealing with crucial issues such as data 
governance and ethics, 3) pre-processing of the data, 4) sharing of the data models, 5) predictive 
modelling, 6) interventions including dashboards and other strategies, and the measurement of their 
impact on the learning process. The conversation about this open learning analytics framework is ongoing 
and influencing the design of major learning analytics services such as Jisc's Open Learning Analytics 
Architecture (Sclater, Berg, & Webb, 2015) and the Apereo (2015a) Learning Analytics Initiative. These 
frameworks have many interrelated components, and they digest a rich variety of data. In this paper, we 
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will explore the roles synthetic data and the associated software that generates the data can play in 
helping to develop these emerging Big Data learning analytics service infrastructures. 


Through the mechanism of a systematic literature review, we explore whether synthetic data approaches 
have been fully utilized in general, and specifically in the field of learning analytics. Are there already 
significant examples of synthetic data generation and usage whose methodologies are ready to apply 
within the field of learning analytics? Can we argue for a common unifying approach to the generation of 
synthetic data specific to learning analytics through the means of a reference synthetic data generator? 

2 LITERATURE REVIEW 


2.1 General Usage of Synthetic Data 

Synthetic data is primarily used to avoid accidental disclosure or reconstruction of information; for 
example, as part of national microdata sets (Kinney et a I., 2011). There are numerous methods to limit 
the risk (Matthews & Harel, 2011) such as using example data, fitting predictive models with the example 
data, and then generating replacement data from the tuned model. Synthetic data enables the rapid 
prototyping of services before the "real" big data has been amassed or made available to an application. 
Its availability supports proof of concept, security testing, practising, and training around data governance 
processes, boundary testing, user testing of visualizations, and interoperability testing of different 
architectural components, as well as many other applications. 

Synthetic data, also known as simulated data, has been heavily researched and successfully applied across 
a broad range of scientific fields, including economic calculations as part of national micro-datasets 
(Kinney, Reiter, & Miranda, 2014); house occupancy for urban planning; transportation planning 
(Beckman, Baggerly, & McKay, 1996; Rich & Mulalic, 2012); deterioration of sewage systems (Scheidegger 
& Maurer, 2012); support of fraud detection systems (Barse, Kvarnstrom, & Jonsson, 2003); security 
testing of defense in-depth strategies (Boggs, Zhao, Du, & Stolfo, 2014); workload generation for cloud 
computing (Bahga & Madisetti, 2011); simulating real time network traffic (Botta, Dainotti, & Pescape, 
2012); weather behaviour, such as precipitation (Abtew, Moras, & Campbell, 1990; Piantadosi, Boland, & 
Howlett, 2009) and wind (Liang et al., 2013); the number of solar-power cells delivered in a year for a 
given location (Celik, 2003); and for realistic workload generation for YouTube (Abhari & Soraya, 2010). 
Within the field of bioinformatics, synthetic data has been used for the design and analysis of structure¬ 
learning algorithms (Van den Bulcke et al., 2006). 

In the field of data mining, synthetic data has been used to generate and benchmark text-mining 
algorithms and tools (Eno & Thompson, 2008; Jeske, Lin, Rendon, Xiao, & Samadi, 2006); for building and 
testing Information Discovery Systems (Lin et al., 2006); selecting feature set discovery algorithms (Bolon- 
Canedo, Sanchez-Marono, & Alonso-Betanzos, 2013); testing the scalability of big data infrastructures, for 
example by populating and testing the performance of databases of various types (Gray, Sundaresan, 


ISSN 1929-7750 (online). The Journal of Learning Analytics works under a Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) 


108 


JOURNAL OF LEARNING ANALYTICS 


S °)LAR 


(2016). The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics, 3(1), 107-128. 
http://dx.doi.Org/10.18608/jla.2016.31.7 


Englert, Baclawski, & Weinberger, 1994; Tzouramanis, Vassilakopoulos, & Manolopoulos, 2002; Lo, Cheng, 
Lin, Hon, & Choi, 2014); for evaluating visual-analytics techniques (Maciejewski et al., 2009); for the 
generation analysis of social networks (Barrett et al., 2009); and to create training datasets for handwriting 
recognition (Varga & Bunke, 2008). 


A recent review of learning analytics in UK higher and further education suggests that the emerging 
market for learning analytics products is highly fragmented (Sclater, 2014). Therefore, a great challenge 
for institutions is the risk of vendors developing and marketing similar systems that tackle different parts 
of the learning analytics infrastructure, but have not been made interoperable. Within this context, 
synthetic data has the potential to accelerate the development of big learning analytics infrastructure and 
methods and avoid unnecessary delays by early disclosure of realistically distributed, descriptive data that 
has the property of minimal risk of accidental disclosure (Matthews & Harel, 2011). The data can form the 
basis of benchmarks as it, and the systems developed towards their generation, can be shared freely as 
part of that benchmark. One focus of such benchmarks will be to support decision makers in choosing 
between a series of similarly visually appealing products. However, it should be noted that the challenge 
of bias in the generated data could lead to poor decision making. Consider the problem of class imbalance 
and the need to oversample minority populations (He, Bai, Garcia, & Li, 2008). Clearly labelling the degree 
of bias of the benchmarks in order to assist decision makers will be a challenge. 


Another grand challenge for large organizations is to centralize data and, by implication, their governance 
(Ebner, Taraghi, Sarantie & Schon, 2015). This centralization allows universities to analyze a wider range 
of datasets for a broader audience with the support of central data governance to deploy learning 
analytics services across departmental boundaries. There is a risk of an emerging divide in the quality of 
these services between those organizations that strive for data centralism and those that do not (Berg, 
2015). This divide has previously been reported from within the business context with suggestions for 
accelerating progress through business culture transformation, centralization of data, and the use of 
standards (Kiron, Shockley, Kruschwitz, Finch, & Haydock, 2011). 


A survey on the subject of data quality management for big data analytics (Kwon, Lee, & Shin, 2014) 
discovered a positive relationship between a firm's competence in maintaining quality (i.e., consistency 
and completeness) and the firm's adoption intention for big data analytics. Synthetic data can be used to 
either replace missing data (completeness) or support the disambiguation process (consistency). For 
example, when using a broad range of social media as part of a learner's experience, there is a risk of 
students using multiple credentials. We might name ourselves jdbergl892 for our twitter account and 
john.doe.berg.1 for our Linkedln account. Synthetic data has been applied in the development of 
disambiguation methodologies to define strategies to resolve this issue (Ferreira, Gonsalves, Almeida, 
Laender, & Veloso, 2012). 
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2.2 Usage within the Field of Learning Analytics 


Ferguson (2012) noted that one of the challenges for learning analytics is to develop methods of working 
with a wide range of datasets in order to optimize learning environments. We argue that synthetic data 
supports the creation and refinement of processes prior to the data from multiple silos being freely and 
fully available. For example, it can be generated and utilized while waiting for approval from multiple 
ethics boards or working through politically sensitive data ownership issues. Synthetic data will also 
support researchers who do not have access to rich data sources, allowing them to tune and tweak their 
methodologies so that they can interact efficiently with "elite" researchers in more advantageous data 
centralized environments. 


There is a close relationship between the Educational Data Mining and Learning Analytics communities, 
(Siemens & Baker, 2012); many methodologies and practices are shared between them. As evidenced in 
the last section, synthetic data generators are already applied in many data-mining contexts. A concrete 
example is the application of synthetic data to sparse probit-factor analysis to test the efficacy of 
estimating a learner's knowledge of the concepts within specific problem domains (Waters, Lan, & Studer, 
2013). 

There is also evidence of the use of synthetic data as part of the process of disseminating and practising 
learning analytics methodologies. For example, this occurred at a data manipulation hackathon 
(University of Michigan, 2015a), and is part of the training materials within a learning analytics MOOC 
(University of Michigan, 2015b; Koester, 2015). 

The EP4LA Ethics and Privacy Workshop Series (Sherlock, 2014) is a set of interrelated workshops 
discussing a broad range of issues including, but not limited to data ownership, data degradation, 
anonymization of data, data security, data sharing, danger of linking datasets for privacy, context integrity, 
approaches to informed consent in the times of big data, expected changes to privacy due to big data, 
cross-cultural studies on privacy, transparency (purpose of analysis, raw data access, opt-out), and ethical 
considerations for learning analytics. As discussed in the introduction, synthetic data will play an 
alleviating role for issues across these themes. 

Verbert, Manouselis, Drachsler, and Duval (2012) applied a framework mapping the high-level properties 
of datasets against their LA objectives. Through this utilitarian optic, the authors reviewed a range of 
datasets and their relevance for application within the field of learning analytics. They noted, "our 
endeavors to collect and share datasets for research remain quite challenging" (p. 145) and described how 
a number of datasets were made open. By modelling closed datasets, synthetic data generation can 
extend the range of open datasets available for characterization and experimentation. 

The Apereo Learning Analytics Initiative (LAI) is applying synthetic data for performance testing its 
reference learning analytics infrastructure and the test plans (Apereo, 2015b). This was also used to 
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populate a Learning Record Store, a secure central repository for learners' activity streams, with example 
data to allow data scientists to experiment with learning analytics related visualizations while dashboard 
building (SoLAR, 2015). There is currently a discussion within the Apereo LAI community on the subject of 
extending the test plans to reflect emerging practices around xAPI recipes. For example, deriving tests 
based on recipes expressed in the connected learning analytics toolkit (Kitto, 2015). 

Synthetic data generation has the potential to support large-scale, complex, and thus "big" learning 
analytics services such as a layered set of national or institutional services. In the next sections, we reflect 
on the opportunities for the application of synthetic data to big services. First we reference Jisc's Open 
Learning Analytics Architecture (Sclater, 2015b), which has been designed to allow universities and 
colleges in the UK to engage with learning analytics using a freely provided hosting service. Next, we look 
at the institutional level via the UvAInform project at the University of Amsterdam (Kismihok & Mol, 2014). 
Here a coordinated set of pilots is being carried out to develop a wider understanding of the value of 
learning analytics services within the university. The analysis of these two endeavours is followed by a 
review of the trend of increased sharing and richness of learning activity data outside the control of 
learner-centric organizations. We examine the implications and discuss the need for a reference 
implementation of a synthetic data generator. 

3 THE JISC OPEN LEARNING ANALYTICS ARCHITECTURE 

In response to requests for the provision of basic services to help institutions adopt learning analytics in 
the UK higher and further education sectors (Sclater, 2014), Jisc has developed an open learning analytics 
framework and is commissioning associated software components from a range of vendors (Sclater, 
2015b). In summary, data sources — initially primarily from the virtual learning environment and the 
student information system — are extracted into a "learning records warehouse" which contains both 
unstructured and structured data, including learning records in the xAPI format (Tin Can 1 ). Furthermore, 
there may also be "self-declared" data from students, such as e-portfolio content or data from wearable 
devices. 

A learning analytics processor carries out the predictive analytics and provides the results to staff 
dashboards. A student app enables students to view their own analytics, set targets for learning, log their 
learning activities, and compare their engagement and attainment with others. Meanwhile an analytics 
based alert and intervention system prompts staff and students in the case of certain specific situations, 
such as a student's engagement signalling that they are at risk of dropout. This system also helps to 
manage any subsequent interventions with students. Students are also given a degree of control over 
what is done with their data by means of a student consent service. Note that the dashboard and app are 
relatively unintelligent, which allows different visualization tools to be slotted in. The (potentially quite 
complex) processes of managing alerts and interventions take place in the alerts and intervention system. 


1 http://tincanapi.com/ 
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Figure 1: An overview of Jisc's learning analytics architecture. 


A number of issues arise when considering the use of synthetic data within this architecture: 


Security testing: The complexity of the various systems involved and the "big" data that they create 
(including self-declared data) suggest that a wide range of synthetic data will be required in order to carry 
out security testing prior to real data being entrusted to the infrastructure or its components. 


Interoperability testing: A variety of modular systems from different vendors is being commissioned at 
different levels of the architecture in order to provide a cohesive overall learning analytics service for 
institutions. Each one of these systems could potentially (at any point in the lifecycle of the open learning 
analytics framework) be replaced by one from a different vendor. Thus, a core set of synthetic data is 
essential in order to ensure that data can pass interoperably through the different levels of the framework 
— so that alternative tools at each level can be tested quickly and effectively. The use of real data in a 
development or acceptance environment involves a significantly enhanced risk of unintended disclosure. 
This is because the lower quality of alpha and beta software and the number of actors involved in these 
non-production environments polynomially enhance the opportunities for attacks (due to the increased 
number of viable combinations of interactions with the system) compared to the more stable and locked 
down production environments. 
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An initial dataset of learning activity data for interoperability testing is being generated from a set of 
Moodle courses developed to explain aspects of the architecture. While this is "real" data provided 
automatically by users of the courses, it provides a useful basis for the generation of a larger scale 
synthetic dataset. 

Benchmarking for predictive models: In order to compare and contrast different predictive models, a set 
of uniform benchmarks will be required. The benchmarks do not just include ways of comparing, but 
should also include example datasets or methods of generating realistic datasets on demand. Synthetic 
data enables those without access to full and "rich" datasets to compare their services to those where 
they are available. Synthetic datasets avoid concerns of disclosure or partial coverage. 

Ethical and legal compliance: Learning analytics systems need to be tested with cohort data either real 
or synthetic. Testing may be across different institutions using the products of multiple vendors. The key 
ethical and legal issues arising in the literature around learning analytics are summarized by Sclater 
(2015a) and addressed in Jisc's Code of Practice for Learning Analytics (Sclater & Bailey, 2015). Using 
synthetic data can help to avoid ethical and legal issues, in particular breaching the privacy of "real" users 
and the need for institutions to adhere to strict data protection regulations. European legislation, for 
example, prohibits the transfer of personally identifiable data outside the European Economic Area except 
in strictly controlled circumstances; the use of synthetic data means that researchers can collaborate 
internationally without needing to be concerned about breaching such laws. 

Staff training: A paper reporting the experiences from the deployment of analytics services noted that 
the "the initiative was hamstrung by a lack of availability of data management experts who could devote 
the amount of time necessary to produce and disseminate the datasets in a form that the researchers 
could use on an ongoing basis" (Buerck & Srikanth, 2014, p. 133). For staff to interact with and maintain 
complex systems requires training. Applying synthetic data to the systems again avoids privacy issues and 
allows data to be created that model the full range of outcomes, some of which may not yet have been 
created by "real" students. For example, the full ontology and the recipes describing new types of 
interactions have yet to be defined for learning management systems (Kitto, Cross, Waters, & Lupton, 
2015). The synthetic data themselves can be considered self-descriptive, giving the trainees valuable 
context information during simulations of their working environment. 

Additional services: New big data services will emerge. A tried and tested set of synthetic data will enable 
these to be tested alongside existing services. 

4 UvAINFORM 

The University of Amsterdam initiated the UvAInform project in 2013 (Kismihok & Mol, 2014) in order to 
coordinate strategically institution-wide learning analytics services. The project has evolved from one that 
initially took a centralized approach to the development and implementation of these services. It is now 
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more devolved, involving seven different pilot projects across the various faculties of the university. The 
objective is to gain experience, learn lessons, and develop expertise across the university. Furthermore, 
the project initiated the development of an open source Learning Record Store (LRS) to collect student 
activity (Apereo, 2015c), which in combination with a data warehouse and an open source Extract 
Transform and Load layer (Roldan, 2015) aimed to unlock the large number of data silos within the 
university, many of which were never developed specifically for learning analytics purposes. 


Given the data-driven nature this endeavor (at least for the time being), the UvAinform pilot project 
leaders are not always able to fully articulate their intentions and/or desires. With no firm policy 
framework in place to guide and direct data governance, the vision of a fully data saturated LRS remains 
elusive. Budgetary and political constraints meant that instead of developing an overall strategy for a 
university-wide learning analytics framework, a less ambitious approach needed to be taken. This 
approach entailed having seven faculty-level pilots, which set out their requirements regarding connecting 
specific data sources to the LRS. 

During the initial stages of the UvAinform project, 61 different information systems (IS) that use and store 
education-related data have been identified (Kismihok & Mol, 2014). Some of these systems are core 
elements of educational activities (such as the university's Learning Management System); some are 
minor software targeting a specific educational or administrative aim (e.g., faculty level thesis 
administration). Some of them are well integrated, but most systems exist as "islands" or data silos, 
without communicating with any other IS. Furthermore, silo gatekeepers are understandably wary of 
granting access to "outsiders" to "their" data sources (Kismihok & Mol, 2014). 


The LRS was populated with activity data from the university's LMS (Blackboard) combined with data from 
the Student Information System (SIS) and the timetabling system. The range of data is expected to 
increase rapidly and to include more sources centred on the group activities found in flipped classrooms, 
especially video clips and forums. Even though the current pilots only use these three data sources, a 
number of UvAinform project members were facing challenges, including: 


• How to transmit large amounts of data from the three sources to the LRS 

• How to transmit large amounts of data from the LRS to the dashboards associated with the pilot 
projects 

• On what basis should partners set the technical requirements of data management for the seven pilot 
projects? Should the infrastructure be centralized or decentralized? The pilots by their nature are 
short term and use relatively few resources but if successful may need to be generalized or scaled-up 
quickly. 

• Testing the scalability of both the LRS and the pilot systems in terms of data processing and data 
management 

• Lack of experience within the organization about data delivery. How to share the data with authorized 
users ethically and technically? How should the university govern such authorization? 
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• A lack of empowerment and influence on the part of key UvAInform stakeholders to evangelize and 
facilitate the cultural change associated with this challenging data-governance issue 


A communication tool such as the learning analytics readiness instrument (LARI), a survey to measure 
institutional readiness would have helped us to understand and communicate where to focus our efforts 
(Arnold, Lonn, & Pistilli, 2014). However, the UvAInform pilots were a necessary precursor to developing 
an institutional culture of greater data-driven decision making. 

4.1 Transparency of Educational Data Management 

A study revealed that 22 internal stakeholder groups have an interest in the UvAInform project (Szorenyi 
& Kismihok, 2015). This puts management in a difficult position since it is close to impossible to meet the 
requirements of all stakeholder groups. With few exceptions, all of these groups have claims on 
educational data. They are either data creators (students, teachers), data managers (technical support, IS 
management), or policy and decision makers (legal, ethical boards, and management bodies that use 
educational data for their decisions). The majority of these stakeholders face issues with overall 
educational data management, such as: 


• Overseeing the data management processes within the organization 

• Obtaining a clear picture of precisely what individual-level data is being recorded 

• Knowing what is happening with the individual data (which IS at the university uses what data and 
how) 

• Finding the barrier between the data the university is responsible for and the data that does not fall 
under its authority (for instance social media data, data generated by mobile devices, or data 
mirroring labour market information in a student goal-setting application; see Kobayashi, Mol, & 
Kismihok, 2015) 

• Deciding under what circumstances data can leave the premises of the university. Ongoing research 
at the University of Amsterdam has revealed, for instance, that students have little idea about how 
their educational data is being managed by IS vendors and the government (Stuurman, 2015). 

4.2 Empowerment of Learning Analytics Research 

According to the lessons we learnt during the UvAInform pilots, learning analytics researchers and 
teachers involved in experiments around learning analytics have limited possibilities to pilot their software 
and algorithm prototypes. Lack of access to relevant data sources, due to the aforementioned 
characteristics of the local information architecture and its decision-making loops, can impede the 
progress of research. There is a clearly articulated need for a "data sandbox" that accurately models the 
data structures and data types of the various ISs in the organization. Breaking down data silos takes time. 
However, synthetic data will allow researchers and affiliated technical staff to build the services before 
the politics, ethics, legal, and data-cleaning issues have been resolved, or even find out what data exists. 
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This allows research to work in parallel with those processes, significantly decreasing the time to delivery 
of services to the target audience. 


To summarize, we believe that two key lessons may be drawn from the UvAInform Learning Analytics 
Program. First, data centralism is key to developing a learning analytics framework and facilitating the 
development of learning analytics services at the university level. Second, a central body in the 
organization, such as the Centre for Data Governance and Innovation, could serve as a hub for facilitating 
the discourse and innovation around the ethical and privacy concerns raised by creating large-scale 
learning analytics frameworks. The Centre is a natural part of the organization to curate the synthetic data 
and benchmarks. 


Although not ideal, the risk of an emerging digital divide for researchers with data centralism and those 
without is diminished as methodologies can be tested with synthetic data and used to cross-validate 
learning analytics projects. 

5 Cross-Institutional Adoption of LA 

The previous two sections have explored the value of synthetic data within real world situations. The 
UvAInform project took place in a typical university that wanted to research the requirements and impact 
of learning analytics. All the data used comes from within the university. Meanwhile the Jisc learning 
analytics architecture is a prototype for regional or national services. Here the activity data comes 
primarily from the participating organizations that consume the services, and remains under each 
institution's control. It is not yet possible to quantify the amount of self-declared data that will be provided 
by students using the student app or other input mechanisms. The volume and complexity of this "big" 
self-declared data will increase as the service matures and the service providers explore new ways of 
utilizing it. 

A further theme to explore is the trend towards the use of learning activity data outside organizational 
boundaries. Online learning occurs, of course, not just within the organization's systems, but can take 
place within a wide variety of social media platforms and other web-based systems. This has an impact 
on the availability, the quality of learning activity data, and the increasing richness of the data that learning 
analytics services can utilize. It implies that the synthetic data generator needs to be flexible and cover an 
ever-increasing set of rich data sources. This section details the pressures, and briefly examines synthetic 
data's role for these sources. 

Learning activity data provides challenges for a university's data governance processes. One of these is 
that students and teachers are regularly engaged in learning activities outside the sphere of control of the 
educational body or regional services in which they are embedded. There is an incremental loss of access 
to data caused by the increasing number of globalized services (such as MOOCs, Google Docs and Twitter) 
used. This lack of control over the data by the institution may increase the legal and ethical risks for data 
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subjects, particularly students. 


The globalization of services is due to a number of trends, which include: 


Pedagogical practices centred on blended learning and the flipped classroom (Bishop & Verleger, 2013). 
These are actively engaging groups of students with social media. The application of services such as 
Facebook (Junco, 2012; Ahn, 2013), Twitter (Junco, Heiberger, & Loken, 2011), YouTube (Ammari, Lau, & 
Dimitrova, 2012), and other social media has been shown to improve student engagement. Although 
there are pitfalls, such as the quality of the content provided (Duncan, Yarwood-Ross, & Haigh, 2013), 
these engagement tactics imply that a considerable percentage of the activity data has escaped central 
control. A natural consequence is that it might be feasible to collect some of the data for some of the 
students, but not the full range. Learning analytics services will need to replace some of the missing data 
to optimize the value of the collected data. The impact of example replacement methods to fill in the gaps 
is discussed by Farhangfar, Kurgan, and Dy (2008). 

Traditional LMSs have tended to fully integrate the majority of their functionality such as wikis, forums, 
chat, polls, and resource areas within one application. The higher education sector is moving away from 
the concept of a monolithic LMS where all the services are contained in one application to a thinner LMS 
that orchestrates and enhances learning partially through a series of external tools fulfilling specific 
functionality (Dagger, O'Connor, Lawless, Walsh, & Wade, 2007). In general, the trend is towards thinner 
LMSs orchestrating a collection of third-party services. The design practice supports scalability and eases 
the effort to migrate and support third-party specialization. IMS Global's Learning Tools Interoperability 
(LTI) protocol allows a standalone application to appear to be working within different LMSs. The number 
of tools mentioned on the LTI conformance page (IMS Global, 2015) evidences the popularity of this 
approach. The Caliper sensor API 2 builds on this approach and allows for the collection of activity data in 
a standard format from a range of systems. IMS Global is working on an LTI compatible extension to track 
activity with a standardized ontology. The authors expect there to be a data quality divide between 
applications that apply learner activity standards, such as Caliper and xAPI (Kevan & Ryan, 2015; ADL, 
2015), and non-standards-based applications. A synthetic data generator with generic capabilities to 
generate output for these standards will by default cover a wide and increasing range of compatible tools. 


Dahlstrom, Brooks, and Bichsel report for the American higher education sector that "the average age of 
an LMS is eight years, and 15% of U.S. institutions are planning to replace their LMS within the next three 
years" (2014, p. 3). Although the velocity of replacement of a full LMS is relatively slow, the use of 
standards enables the incremental diversification of feature sets outside the LMS and therefore wider 
diffusion of learner activity. Dahlstrom et al. also note, "User satisfaction is highest for basic LMS features 
and lowest for features designed to foster collaboration and engagement" (2014, p. 4). If user satisfaction 
is the dominant driver, then expect an increasing range of applications used to foster better engagement 


http://www.imsglobal.org/IMSLearningAnalyticsWP.pdf 


ISSN 1929-7750 (online). The Journal of Learning Analytics works under a Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) 


117 



JOURNAL OF LEARNING ANALYTICS 


S °)LAR 


(2016). The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics, 3(1), 107-128. 
http://dx.doi.Org/10.18608/jla.2016.31.7 


and collaboration. Therefore, the application of popular social media will continue to increase. A synthetic 
data generator will need to be flexible to adapt quickly to the sector-wide incremental evolution of LMS 
services. 


MOOCs: There are differences between US and European attitudes towards the take-up of MOOCs, with 
European Universities planning more significant adoption (Jansen & Schuwer, 2014). There is a difference, 
for example, in the social media systems used locally e.g., Xing (Statista, 2015). Further variations include 
distribution across the alphabet of surnames (and one would therefore assume login names; ISOGG, 
2015), the language of the content within the MOOCS, and the demographic weighting of the students 
and teachers. All these factors will influence the way that synthetic data is generated. 

MOOCDb (Veeramachaneni & Dernoncourt, 2013) is an MIT project that enables researchers and 
practitioners to share MOOC data in a common format. If European adoption is a significant trend 
influencing the overall use of MOOCs and a representative portion of the activity is shared via MOOCDb 
then we should use the MOOCDb dataset to shape the synthetic data generator's output. 

Cloud services enable outsourcing of what were traditionally considered core services such as e-mail (e.g., 
Google Mail) and LMSs (e.g., Canvas, Apereo OAE). Bedrossian et al. noted, "The economies of scale, 
resiliency, flexibility and agility provided by cloud computing are rendering the construction and 
maintenance of on-premises data centers obsolete" (2014, p. 2). However, there are significant concerns 
about security in the cloud and potential solutions such as trusted third parties (Zissis & Lekkas, 2012) that 
will impact the availability and practices surrounding activity data. 

Universities are increasingly using federated identity management to share services and enable students 
to learn across organizations. For example, SURF (2015), the Dutch higher education federation, lists over 
sixty services and itself is attached to an overarching hub of federations known as Edugain. 3 As the 
popularity of the federative approach to services widens, organizations will need to share their activity 
data and uniformly apply student consent rules. Synthetic data will allow researchers to simulate the 
impact of adoption of different consent processes. 

Devices in general such as activity trackers, the Internet of Things (Swan, 2012), room occupancy, brain 
computer interfaces (BCI), EEG devices for emotion mapping, occupancy sensors, house networks and car 
networks may play a role in supporting learning. BYOD (Bring Your Own Device) policies at institutions are 
encouraging the use of tablets, smartphones, smart watches, and e-readers with Wi-Fi connectivity, and 
enabling the viewing of content and interaction in different ways to desktop computers. For example, 
third-party apps allow you to monitor your heartbeat through the camera on your smartphone or use it 
as a clicker device. These new applications have the potential to impact course design. The generated data 
can then be fed back into predictive models, which then trigger interventions. This increases the range of 


3 https://technical.edugain.org/status.php 
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data sources relevant for learning analytics services. Complexity leads to insecurity and an increase in the 
risk of successful implementation of learning analytics services. The complexity and range of interactions 
possible in the scenarios mentioned make it difficult to secure the data. 


Appropriate use of BYOD can improve the grades of students (Cristol & Gimbert, 2013). Thomson (2012) 
noted that we should not focus on issues such as whether to allow people to use their iPads at work. 
Rather, focusing on solutions is the bigger business challenge — enabling technology for competitive 
advantage. We should take into account the concerns of the consumer (student, teacher, etc.). Lebek, 
Degirmenci, and Breitner (2013) surveyed 151 employees and found that security aspects and the legal 
situation worry employees more than their individual privacy. The implication of these concerns is that 
we should focus our efforts on full end-to-end testing before considering building a sophisticated student 
consent service. 


Hashizume, Rosado, Fernandez-Medina, & Fernandez (2013) identify the main vulnerabilities for cloud 
computing. The list of vulnerabilities and countermeasures should be considered a limited subset of all 
the possible attack vectors. The value of personally identifiable information (Pll) is high and, as has been 
seen in recent high-publicity data breaches (e.g., BBC, 2015), significant reputational impact occurs when 
the data is accidentally disclosed. The complex technical infrastructures involved require frequent expert 
testing to minimize the risk of exposure. Synthetic data again can perform a vital role by allowing early 
testing before the systems are fully secured. Furthermore, the data itself can be considered a form of 
documentation; by exchanging synthetic data, developers have more opportunity to validate the end-to- 
end processes of their software and to test its performance. Synthetic data generation is also applicable 
for the multiple new Internet-connected devices that are emerging. Anderson, Kennedy, Ngo, Luckow, 
and Apon (2014) note that research on Internet of Things data can be constrained by concerns about the 
release of privately owned data, and have therefore implemented a synthetic data generator to help 
diminish this issue. 

6 DISCUSSION 

In this paper, we reviewed a university project based on faculty pilots, and a national infrastructure that 
has the potential to become a template for further large-scale projects. We then looked at some of the 
challenges for the sharing of learning experience outside traditional data silos, with the data being spread 
across legal and geographical boundaries. Under these pressures, it is difficult to fully optimize data-driven 
analytics services with a set of real "big" data. We argue for a comprehensive, realistic, shared set of 
synthetic data generated through an easy-to-apply tool. The synthetic data should encompass all systems 
with which the student or teacher interacts. This will enable practitioners to prioritize the data 
requirements and governance around learning analytics services. The tool will empower designers to 
explore a full range of possible services without the barrier of gathering data from multiple and 
idiosyncratic infrastructures. 
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For governance processes, a simple solution is to ignore the external data and consume only the data 
from internal data silos. However, the Jisc infrastructure empowers students to incorporate self-declared 
data. This strategy will be put under pressure as external learner activity increases and predictive models 
and associated interventions using this external data are energetically adopted. It is not only a loss of 
control of the activity data that requires careful examination of data governance, it is also the increasing 
number of third parties involved, scattered across many geographical locations. These organizations are 
under the authority of a number of legal frameworks driven by different cultures of ethics and privacy. 
The self-declared approach neatly avoids complex policy decisions and supports fine-grained student 
consent. Delegation also avoids a significant degree of central administrative effort. This delegation 
empowers the student to choose to share the data. However, if only a portion of the students within a 
cohort connect their external data, this will cause issues with the coverage of the values returned from 
predictive models. Synthetic data can play a role in replacing the missing data (Baraldi & Enders, 2010); 
for example, replacing missing data with mean values or estimates from regression models. Synthetic data 
can also support simulations to estimate the thresholds set for when the volume of student self-declared 
data is acceptable as an input to student retention systems. 


Even if we design in well-articulated governance processes, if we cannot secure to a high degree of 
certainty the data within the boundaries of trusted parties, wherever the learning experience takes place, 
then the governance process is flawed. For large service providers (Google, Amazon, Microsoft, etc.) 
individual universities will not be able to exert enough pressure to achieve reasonable data governance 
processes. For a sector-wide, global data governance body that represents the concerns of universities, 
the collective influence over third parties is significantly greater. For example, it could recommend 
standards that government procurement agencies should adhere to, and define sector-wide policies and 
best practices around the full end-to-end process. 


Meanwhile, the more complexity there is, the more testing is required to manage risks and deliver stable 
and secure services. Synthetic data naturally supports data-driven testing. In the medical field synthetic 
data has been used to generate patient records that collectively simulate the outbreak of infectious 
diseases (Buczak, Babin, & Moniz, 2010) avoiding privacy and anonymization issues. Buczak et al. note 
that there is no consistent set of test data and that only a small number of institutions have a full set of 
data. We argue that the same conditions currently exist within the field of LA. 


Large-scale infrastructures being built for learning analytics services deliver wider opportunities, such as 
academic analytics services focused on the management of institutions. Promising for curriculum design 
is the work at The Open University UK (Rienties, Toetenel, & Bryan, 2015) where individual learning 
trajectories are aggregated to look at learning design patterns. The aggregation across curricula is not 
possible without central control of learner data. Once research leads to services, universities with data 
centralism will have significant advantages, such as the early exploration of a richer and more 
representative set of data, compared to the unconsolidated universities. A broad range of realistic 
synthetic data will enable researchers to design and test their research practices and algorithms, 
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enhancing the degree of potential co-operation across an emerging divide. 


Standards such as xAPI and Caliper enable the storage of learning activity data in well-defined formats. 
However, the recipes around how to use those data structures do not yet cover the majority of learning 
scenarios and are not widely adopted. An example of defining relevant recipes is that of Kitto et al. (2015). 
However, this research needs further expansion and adoption of recipes to cover a much greater range 
of situations. The lack of a fully documented and accepted range of recipes risks inconsistent application, 
implying greater effort in consolidating activity datasets, increasing costs, and potentially slowing down 
research projects. Once a range of recipes has been accepted, adaption of tools such as the simple Apereo 
(2015b) stress test plans will allow for the generation of a wider set of reference datasets. This approach 
easing the issues mentioned in previous sections such as accidental disclosure or the inability to test 
complex infrastructures. 

7 CONCLUSION 


The literature review showed that synthetic data generation is widely applied outside the field of learning 
analytics. Because educational data mining and learning analytics research are closely related, synthetic 
methodologies are, to an extent, already embedded within specific learning analytics research methods. 
There is a small set of clearly applied applications within the field, such as in the performance testing of 
learning record stores and supporting training exercises through MOOC courses. 

We have discussed the significant drivers for increasing the richness of learning activity, and hence the 
increasing production of learning activity data. This is due to pressures such as the adoption of online 
teaching methodologies and the increasing range of online services. Meanwhile many universities are 
expanding the use of analytics services. This combined with potentially highly rich datasets is increasing 
the need for synthetic data generation. The complexity of interactions and range of possible data sources, 
combined with the need to avoid accidental disclosure, require a synthetic data generator that is easy to 
extend, simulating a wide range of real datasets. 

The current state of benchmarking for big data where "workloads currently discussed in the testing and 
benchmarking community do not capture the real complexity of big data" (Alexandrov, Brucke, & Markl, 
2013, p. 1) argues for continued research specifically around the theme of capturing the richness and 
range of the datasets. As a community, we should consider building or adopting an easy-to-use, easy-to- 
extend synthetic data generator that generates realistic learning activity data. As a standards-based 
learner activity collection is increasingly adopted within higher education, synthetic xAPI data generation 
will become increasingly necessary. The xAPI recipes mentioned by Kitto et al. (2015) are a starting point 
for a generator. The improvement of the test plans held by the Apereo Foundation is a potential solution 
for a reference implementation. 

The generation of rich datasets for testing learning analytics applications requires coordination across the 
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community of researchers and developers in higher education and liaison with vendors. A significant 
opportunity exists to work collaboratively towards generating standards-based synthetic datasets to 
ensure robust, secure, scalable architectures, and valid learning analytics. 
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